tooling(scripts): add per-template sweep classifiers (#187/#190/#192/#193)#194
Open
hyperpolymath wants to merge 2 commits into
Open
tooling(scripts): add per-template sweep classifiers (#187/#190/#192/#193)#194hyperpolymath wants to merge 2 commits into
hyperpolymath wants to merge 2 commits into
Conversation
…-workflow campaign Durable tooling for the wrapper-sweep work that follows each of the foundational reusable PRs (#187 mirror, #190 secret-scanner, #192 codeql, #193 hypatia-scan). Each classifier: - reads a paginated `gh api /search/code` JSON dump - fetches each unique blob SHA exactly once (cached in $BLOBS_DIR) - emits per-repo TSV: <repo>\t<sha>\t<class>\t<reason>\t<lines>\t<details> Classes vary per template but follow the same shape: TRIVIAL (canonical match, mechanical wrapper) vs SLIM/MISSING/OLDER (propagation lag, auto-upgrades on first run after wrapper merge) vs NEEDS_REVIEW (custom workflow body, requires per-repo diff). Numbers produced by these classifiers across the four campaign templates: - mirror.yml — 267/289 TRIVIAL (92.4%); 22 NEEDS_REVIEW - secret-scanner — 273/281 missing shell-secrets (97.2%); 1 TRIVIAL (standards itself) - codeql — 246/263 mechanical (93.5%); 17 NEEDS_REVIEW - hypatia-scan — 249/255 safe-to-standardize-up (97.6%); 6 NEEDS_REVIEW README documents the path-filter caveat: `gh api /search/code` with `path:.github/workflows` excludes monorepo-nested workflow files; the broader `filename:` query (no path filter) catches them. For hypatia-scan, the broader query returns 704 vs the 255 path-filtered count — the ~449 nested copies also need wrappers when sweeps fire.
Same as #192 (codeql-reusable) — auto-merge enabled but zero workflow runs against the head commit. Pushing empty commit to re-trigger CI.
hyperpolymath
added a commit
that referenced
this pull request
May 26, 2026
…ergence set (#205) ## Summary 5th and final reusable in the workflow convergence campaign (see #199 for the meta-doc). Consolidates the per-repo `scorecard.yml` workflow. ## Drift signal (full pagination + per-repo verified) - **258** top-level estate deployments - **626** nested copies in monorepos (asdf-tool-plugins, developer-ecosystem, ssg-collection, standards, ambientops, julia-ecosystem, etc. — Layer-2 truncation discovery via #204's helper) - **46** unique blob SHAs / 17.8% structural drift - Top SHA covers **100/258 (38.8%)** — highest dominant-cluster of the 5 campaigns - Top 7 SHAs cover ~80% - **100% mechanical drift, ZERO feature variance** — SPDX header (PMPL-1.0 / PMPL-1.0-or-later / MPL-2.0), `upload-sarif` SHA-pin churn, `permissions: read-all` vs `contents: read` wording ## Design - One input: `runs-on` (default ubuntu-latest) - No `secrets: inherit` — Scorecard uses `GITHUB_TOKEN` directly - Caller MUST grant `security-events: write` + `id-token: write` on the calling job (called-workflow permissions are capped by caller) - Caller keeps own `on:` triggers + `concurrency:` group ## Per Layer-3 caveat from the campaign meta-doc Nested workflows are inert — GitHub Actions only runs `.github/workflows/` at the repo root. Sweeping the 626 nested copies is single-source-of-truth cleanup, not security hardening. ## Campaign convergence set (closes with this PR) | PR | Template | |---|---| | #187 | mirror-reusable.yml | | #190 | secret-scanner-reusable.yml | | #192 | codeql-reusable.yml | | #193 | hypatia-scan-reusable.yml | | #194 | sweep-classifier scripts | | #199 | campaign meta-doc | | #204 | list-workflow-paths.sh (bypass /search/code undercount) | | **this** | **scorecard-reusable.yml** | ## Test plan - [ ] Wrapper sweep (~258 top-level + ~626 nested) — owner-gated; not part of this PR - [ ] Update classify-* scripts to consume helper TSV — follow-up 🤖 Generated with [Claude Code](https://claude.com/claude-code)
hyperpolymath
added a commit
that referenced
this pull request
May 26, 2026
…consumers (#204) ## Summary Two-commit change adding nested-path support to the sweep-classifier pipeline: 1. **`scripts/sweep-classifiers/list-workflow-paths.sh`** — walks `gh repo list` and queries each repo's Git Tree API directly. Bypasses two compounding undercounts in `gh api /search/code`. 2. **All 4 `classify-*.sh` scripts updated** to consume the helper's TSV output and emit the sweep-target path as an explicit column. ## Why the helper exists — 3 layers of undercount 1. **Layer 1 — path-prefix filter:** `path:.github/workflows` matches the path PREFIX, excluding nested `<subdir>/.github/workflows/<file>.yml` paths outright. 2. **Layer 2 — org-scope truncation:** even broad `filename:<file>.yml org:<org>` queries hit internal caps. Validated against `scorecard.yml`: broad query saw 152 paths (all flagged top-level); per-repo enumeration found **626 additional nested copies** the broad query missed entirely. 3. **Layer 3 — nested workflows are inert:** GitHub Actions only runs `.github/workflows/` at the repo root. Nested copies are vendored templates / stale leftover. Security campaigns gain nothing from sweeping nested copies; single-source-of-truth campaigns still benefit. ## Helper output TSV, one row per matching workflow file: ``` <repo>\t<path>\t<blob-sha>\t<top-level|nested> ``` Cost: one Git Tree API call per repo (~300 calls), uses `core` bucket (5000/hr) not throttled `code_search` (10/min). ## Classifier extensions Each `classify-*.sh` now auto-detects input format from the first byte: - `{` → JSONL from `gh /search/code` (legacy path) - otherwise → TSV from `list-workflow-paths.sh` (preferred — handles nested) Output is unified to 7 columns: `repo \t path \t sha \t class \t reason \t lines \t details`. The new `path` column carries the file's location inside the repo, so sweeps can target nested copies as first-class wrapper sites. Shared `normalize_input` extracted into `_lib.sh`; each classifier sources it. ## Validation Smoke-tested both input paths: - TSV (helper): classify-mirror.sh on scorecard-tuples.tsv (287 repos × top-level + nested) — fetches blobs and emits per-(repo, path) rows. - JSONL (legacy): classify-mirror.sh on mirror-full.json — 267 TRIVIAL + 22 NEEDS_REVIEW, matching prior `/tmp/drift-survey/sweep-report.md`. ## Stacked on #194 `scripts/sweep-classifiers/` only exists once #194 merges. The diff against `main` includes #194's files transitively; once #194 lands, this PR narrows to just the helper + extensions. ## Standing follow-ups - Once this lands, re-survey each candidate with the helper for ground-truth wrapper-site counts before firing any sweep. 🤖 Generated with [Claude Code](https://claude.com/claude-code)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Durable tooling for the wrapper-sweep work that follows each of the four foundational reusable PRs filed today (#187 mirror, #190 secret-scanner, #192 codeql, #193 hypatia-scan).
Adds
scripts/sweep-classifiers/:classify-mirror.sh— for feat(governance): add mirror-reusable.yml — consolidate 289-repo mirror.yml drift #187classify-secret-scanner.sh— for feat(governance): add secret-scanner-reusable.yml — propagate shell-secrets to 281 repos #190classify-codeql.sh— for feat(governance): add codeql-reusable.yml — consolidate 263-repo codeql.yml drift #192classify-hypatia-scan.sh— for feat(governance): add hypatia-scan-reusable.yml — biggest LOC leverage of the reusable trilogy #193README.adoc— usage + nested-path caveatWhat each classifier does
gh api /search/codeJSON dump for the template$BLOBS_DIR)<repo>\t<sha>\t<class>\t<reason>\t<lines>\t<details>Numbers produced across the four campaign templates
standardsrepo carriesshell-secretstodayNested-path caveat (documented in README)
gh api /search/codewithpath:.github/workflowsmatches the pathPREFIX — monorepo nested workflow files (e.g.,
a2ml/bindings/deno/.github/workflows/hypatia-scan.yml) are EXCLUDED.Verified for hypatia-scan: broader query without
path:returns 704results vs 255 path-filtered. The same effect likely applies to the
other three templates; sweep tooling must walk all
**/.github/workflows/<template>.ymlpaths.Pattern
Same shape as
scripts/apply-baseline.sh(paired withscripts/tests/apply-baseline-test.sh) — committed durable toolingrather than ephemeral
/tmpscripts.🤖 Generated with Claude Code